Uisge Beatha: Water of Life or Watered Down? Applied Multivariate Methods in Scotch Differentiation

Methods

Whisky Origin and Chemical Data
Sample_no Descriptor Distillery P S Cl K Ca Mn Fe Cu Zn Br Rb
1 Blend Baile Nicol Jarvie 0.152 1.100 0.173 7.860 1.450 0.032 0.027 0.186 0.015 0.002 0.006
2a Blend Bells 0.653 1.580 0.238 4.930 1.400 0.019 0.110 0.242 0.021 0.005 0.003
3a Blend Chivas 0.375 0.809 0.193 4.310 1.220 0.019 0.044 0.196 0.007 0.003 0.002
4a Blend Dewars 0.121 1.160 0.157 3.200 1.140 0.011 0.050 0.189 0.018 0.003 0.003
5a Blend Johnnie Walker 0.326 1.090 0.180 5.480 0.526 0.018 0.103 0.286 0.020 0.002 0.002
6a Blend The Famous Grouse 0.145 0.615 0.097 2.740 0.416 0.009 0.050 0.208 0.007 0.002 0.001
7a Blend Whyte and Mackay 0.067 0.576 0.151 2.360 0.745 0.012 0.047 0.159 0.019 0.003 0.002
8a Blend William Grant 0.239 0.748 0.147 2.840 0.976 0.010 0.021 0.137 0.020 0.003 0.002
9a Counterfeit Unknown 1 0.089 4.060 0.066 0.336 1.240 0.007 0.154 0.085 0.038 0.005 0.001
10a Counterfeit Unknown 2 0.088 14.700 0.072 1.230 1.400 0.006 0.025 0.052 0.018 0.004 0.001
11a Counterfeit Unknown 3 0.279 15.900 0.083 0.811 1.360 0.006 0.057 0.038 0.016 0.002 0.002
12a Counterfeit Unknown 4 0.320 22.100 0.596 2.320 1.780 0.008 0.019 0.038 0.015 0.068 0.001
13a Counterfeit Unknown 5 0.120 26.100 0.071 2.370 1.630 0.010 0.082 0.187 0.194 0.012 0.005
14a Grain Grain matured 0.034 2.230 0.252 6.440 1.040 0.013 0.115 0.174 0.019 0.004 0.006
15a Grain Grain unmatured 0.084 5.530 0.113 3.250 1.350 0.012 0.076 0.164 0.046 0.010 0.003
16 Highland Glengoyne 1.040 5.570 0.343 24.200 0.857 0.023 0.197 1.251 0.041 0.004 0.016
17 Highland Glenmorangie 0.126 0.796 0.245 6.950 0.859 0.035 0.025 0.523 0.011 0.003 0.006
18a Island Bowmore 0.914 6.670 0.316 21.100 0.868 0.037 0.148 0.548 0.032 0.007 0.018
19 Island Bruichladdie 1.630 5.480 0.697 36.500 4.130 0.038 0.288 0.587 0.066 0.034 0.039
20a Island Bunnahabhain 2.240 7.540 1.350 36.200 2.120 0.051 0.184 0.580 0.057 0.014 0.037
21 Island Talisker 0.034 4.850 0.362 5.670 0.607 0.018 0.070 0.277 0.033 0.003 0.006
22a Lowland Auchentoshan 0.169 1.460 0.417 11.700 0.681 0.042 0.128 1.320 0.037 0.006 0.012
23a Lowland Glenkinchie 0.108 2.450 0.176 7.760 0.738 0.031 0.106 0.434 0.022 0.002 0.007
24 Speyside Balvenie 0.695 3.850 0.120 20.300 0.765 0.031 0.121 0.380 0.035 0.005 0.024
25 Speyside Craigellachie 0.096 0.819 0.177 6.110 0.633 0.024 0.094 0.239 0.025 0.005 0.006
26 Speyside Dufftown 0.883 4.640 0.130 14.000 1.050 0.030 0.078 0.533 0.024 0.002 0.014
27 Speyside Glen Elgin 0.115 1.350 0.404 9.270 1.400 0.031 0.046 0.195 0.029 0.006 0.009
28 Speyside Glenburgie 2.000 7.910 0.185 37.700 1.650 0.053 0.134 0.198 0.043 0.008 0.026
29 Speyside Glennfiddich 0.317 2.720 0.344 12.400 0.660 0.029 0.132 0.519 0.193 0.004 0.013
30 Speyside Glenrothes 0.953 4.110 0.399 16.700 1.830 0.041 0.137 1.030 0.029 0.007 0.014
31 Speyside Knockando 0.051 1.030 0.191 5.140 0.605 0.017 0.094 0.432 0.020 0.008 0.005
32 Speyside Linkwood 0.276 1.050 0.207 6.220 1.010 0.020 0.064 0.769 0.019 0.004 0.006

Initial viewing:

Assesment Mahalanobis Distances

Summary of Surprising Observations
Distance Category Count/HZ %/P-val
Bottom_50% 22.000 68.8
50-75% 2.000 6.2
75-90% 0.000 0
90-95% 1.000 3.1
95-99% 4.000 12.5
Top_1% 3.000 9.4
Henze-Zirkler Test 1.325 <0.001

Post log-transformation

Assesment Log-transformed Mahalanobis Distribution

Summary of Surprise Categories
Distance Category Count/HZ %/P-val
Bottom_50% 17.000 53.1
50-75% 7.000 21.9
75-90% 5.000 15.6
90-95% 2.000 6.2
95-99% 1.000 3.1
Top_1% 0.000 0
Henze-Zirkler Test 0.984 0.137

Structure examination

Whisky Class Hotelling's T² Test Results
Comparison T² Statistic P-value
Counterfeit vs Speyside 7,083.10 0.009
Blend vs Speyside 213.48 0.026
Counterfeit vs Blend 928,150.00 0.009
Island vs Speyside 171.63 0.581
Grouped Hotelling's T² Test Results
Comparison T² Statistic P-value
Provenance vs Counterfeit 1,181.90 0.000
Provenance vs Grain/Blend 137.44 0.000
Grain/Blend vs Counterfeit 474.57 0.042

Whiskies of:

Provenance

Counterfeit

Blend/Grain

Differentiation within the PCA Space

Results:

PCA Analysis:

Standardized Log-data PCA Summary
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 PC11
Standard deviation 2.2650 1.5491 1.0852 0.8647 0.6564 0.6074 0.4985 0.4517 0.4210 0.2857 0.1818
Proportion of Variance 0.4664 0.2182 0.1071 0.0680 0.0392 0.0335 0.0226 0.0186 0.0161 0.0074 0.0030
Cumulative Proportion 0.4664 0.6846 0.7916 0.8596 0.8988 0.9323 0.9549 0.9735 0.9896 0.9970 1.0000

Principal Component Loadings
First Three Components
PC1 PC2 PC3
P 0.311 0.118 −0.196
S 0.094 0.534 0.223
Cl 0.315 0.006 −0.422
K 0.404 −0.161 −0.131
Ca 0.149 0.475 −0.301
Mn 0.379 −0.239 −0.132
Fe 0.310 −0.004 0.469
Cu 0.327 −0.347 0.118
Zn 0.241 0.244 0.570
Br 0.183 0.454 −0.205
Rb 0.415 −0.074 0.088

Kmeans Assesment:

K-means Clustering Comparison
Metric K = 3 K = 4
Cluster Sizes 10, 6, 16 8, 2, 6, 16
Variance Explained 51.6% 58.9%
Avg Silhouette 0.30 0.28
Total Within SS 164.9 140.08
Between SS 176.1 200.92
Total SS 341 341

Silhouettes:

Groupings in PC Space

k = 3

k = 4

PAM Assesment

K = 3

PAM Clustering Results (K=3)
Overall Avg Silhouette: 0.294
Cluster Size Medoid Avg Diss. Separation Avg Silhouette
1 17 4 2.234 2.304 0.344
2 5 10 2.962 2.653 0.138
3 10 18 2.277 2.304 0.287
PAM Clustering Results (K=4)
Overall Avg Silhouette: 0.149
Cluster Size Medoid Avg Diss. Separation Avg Silhouette
1 8 4 1.875 1.839 0.080
2 9 25 1.848 1.839 0.192
3 5 10 2.962 2.653 0.074
4 10 18 2.277 2.304 0.204

Your test here, point out pam similarities and differences from kmeans, reinforcing k=3 as good

Hierarchical Clustering Assesment:

Replication of Hierarchical Clustering

Comparative quality:

Confusion Matrices

Consensus Clustering Results
K-means (k=3) and hierarchical(Manhattan [Complete & Ward], and Euclidean [Ward])
Predicted Counterfeit Grain/Blend Provenance
Counterfeit 5 1 0
Grain_Blend 0 9 7
Provenance 0 0 10
PAM Clustering
Predicted Counterfeit Grain/Blend Provenance
Counterfeit 5 0 0
Grain_Blend 0 10 7
Provenance 0 0 10
Correlation (1-r) Hierarchical Clustering
Predicted Counterfeit Grain/Blend Provenance
Counterfeit 5 1 0
Grain_Blend 0 9 10
Provenance 0 0 7
Euclidean (complete) Hierarchical Clustering
Predicted Counterfeit Grain/Blend Provenance
Counterfeit 5 5 0
Grain_Blend 0 5 15
Provenance 0 0 2

Quality Metrics

Clustering Method Performance Comparison
Global Confusion Matrix Metrics
Method Overall Acc. Average Acc. F1 (Macro) TNR (Macro) F1 (Micro) TNR (Micro) TPR (Micro)
Consensus 0.750 0.833 0.781 0.882 0.758 0.801 0.758
PAM 0.781 0.854 0.827 0.894 0.793 0.835 0.793
Correlation HC 0.656 0.771 0.704 0.836 0.677 0.735 0.677
Euclidean HC 0.375 0.583 0.404 0.711 0.452 0.479 0.452
PAM Class-wise Performance
Counts and Derived Quality Metrics
TP TN FP FN ACC_i MR_i PPV_i TPR_i TNR_i F_class
Counterfeit 5.000 27.000 0.000 0.000 1.000 0.000 1.000 1.000 1.000 1.000
Blended 10.000 15.000 7.000 0.000 0.781 0.219 0.588 1.000 0.682 0.741
Provenance 10.000 15.000 0.000 7.000 0.781 0.219 1.000 0.588 1.000 0.741

Discussion:

Best method, LDA could be used as well to detect counterfeits if lower fidelity is desired. XTRF appears to be a sound method for discriminating malt, grain counterfeits across cluster type though care should be taken with hierarchical clustering in choosing an applicable distance, as results varied widely both in counterfeit discrimination as well as grain/blend and counterfeit discrimination.

More data/different data may be needed to truly discern providence. XTRF has been used to sample metals, compounds imparted by wood, etc; aging process may be more distinguishable than region as the main thing separating classes are chemicals ________, likely derived from whether the whisky is grain, malt (of providence) or a counterfeit (likely additives). Interestingly 2 island whiskies (______ and _______) were very distinct from all other samples, as seen in PC space and euclidian clustering